43 research outputs found

    Multi-source dataset of e-commerce products with attributes for property matching

    Get PDF
    Schema/ontology matching consists in finding matches between types, properties and entities in heterogeneous sources of data in order to integrate them, which has become increasingly relevant with the development of web technologies and open data initiatives. One of the involved tasks is the matching of data properties, which attempts to try to find correspondences between the attributes of the entities. This is challenging due to the at times different names of equivalent properties. Furthermore, some properties may not be equivalent, but still match in 1..n relationships. These difficulties create the need for varied evaluation datasets for two reasons. First, they are needed to evaluate existing techniques in a variety of scenarios. Second, they enable the training of supervised techniques that may even become context-independent if trained with data from diverse enough contexts. To support the evaluation and training of data property matching techniques, we present a collection dataset consisting of product records from four different contexts. These datasets are the result of transforming two different existing datasets. In one of the datasets, some properties were filtered for being too noisy. The resulting processed dataset consists of json files with a listing of the product records and their properties, and a separate grouping of the properties that determines which ones match. It contains information about 2860 entities, with 4386 properties and 13350 pairwise matches.Ministerio de Ciencia, Innovación y Universidades PID2019–105471RB-I00Junta de Andalucía P18-RT-1060Junta de Andalucía US-138056

    LEAPME: learning-based property matching with embeddings

    Get PDF
    Data integration tasks such as the creation and extension of knowledge graphs involve the fusion of heterogeneous entities from many sources. Matching and fusion of such entities require to also match and combine their properties (attributes). However, previous schema matching approaches mostly focus on two sources only and often rely on simple similarity measurements. They thus face problems in challenging use cases such as the integration of heterogeneous product entities from many sources. We therefore present a new machine learning-based property matching approach called LEAPME (LEArning-based Property Matching with Embeddings) that utilizes numerous features of both property names and instance values. The approach heavily makes use of word embeddings to better utilize the domain-specific semantics of both property names and instance values. The use of supervised machine learning helps exploit the predictive power of word embeddings. Our comparative evaluation against five baselines for several multi-source datasets with real-world data shows the high effectiveness of LEAPME. We also show that our approach is even effective when training data from another domain (transfer learning) is used.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2019-105471RB-I00Junta de Andalucía P18-RT-106

    A neural network for semantic labelling of structured information

    Get PDF
    Intelligent systems rely on rich sources of information to make informed decisions. Using information from external sources requires establishing correspondences between the information and known information classes. This can be achieved with semantic labelling, which assigns known labels to structured information by classifying it according to computed features. The existing proposals have explored different sets of features, without focusing on what classification techniques are used. In this paper we present three contributions: first, insights on architectural issues that arise when using neural networks for semantic labelling; second, a novel implementation of semantic labelling that uses a state-of-the-art neural network classifier which achieves significantly better results than other four traditional classifiers; third, a comparison of the results obtained by the former network when using different subsets of features, comparing textual features to structural ones, and domain-dependent features to domain-independent ones. The experiments were carried away with datasets from three real world sources. Our results show that there is a need to develop more semantic labelling proposals with sophisticated classification techniques and large features catalogues.Ministerio de Economía y Competitividad TIN2016-75394-

    TAPON: a two-phase machine learning approach for semantic labelling

    Get PDF
    Through semantic labelling we enrich structured information from sources such as HTML pages, tables, or JSON files, with labels to integrate it into a local ontology. This process involves measuring some features of the information and then nding the classes that best describe it. The problem with current techniques is that they do not model relationships between classes. Their features fall short when some classes have very similar structures or textual formats. In order to deal with this problem, we have devised TAPON: a new semantic labelling technique that computes novel features that take into account the relationships. TAPON computes these features by means of a two-phase approach. In the first phase, we compute simple features and obtain a preliminary set of labels (hints). In the second phase, we inject our novel features and obtain a refined set of labels. Our experimental results show that our technique, thanks to our rich feature catalogue and novel modelling, achieves higher accuracy than other state-of-the-art techniques.Ministerio de Economía y Competitividad TIN2016-75394-

    TAPON-MT: a versatile framework for semantic labelling

    Get PDF
    Semantic labelling refers to the problem of assigning known labels to the elements of structured information from a source such as an HTML table or an RDF dump with unknown semantics. In the recent years it has become progressively more relevant due to the growth of available structured information in the Web of data that need to be labelled in order to integrate it in data systems. The existing approaches for semantic labelling have several drawbacks that make them unappealing if not impossible to use in certain scenarios: not accepting nested structures as input, being unable to label structural elements, not being customisable, requiring groups of instances when labelling, requiring matching instances to named entities in a knowledge base, not detecting numeric data, or not supporting complex features. In this article, we propose TAPON-MT, a framework for machine learning semantic labelling. Our framework does not have the former limitations, which makes it domain-independent and customisable. We have implemented it with a graphical interface that eases the creation and analysis of models, and we offer a web service API for their application. We have also validated it with a subset of the National Science Foundation awards dataset, and our conclusion is that TAPON-MT creates models to label information that are effective and efficient in practice.Ministerio de Economía y Competitividad TIN2016-75394-

    Cloud Configuration Modelling: a Literature Review from an Application Integration Deployment Perspective

    Get PDF
    Enterprise Application Integration has played an important role in providing methodologies, techniques and tools to develop integration solutions, aiming at reusing current applications and supporting the new demands that arise from the evolution of business processes in companies. Cloud-computing is part of a new reality in which companies have at their disposal a high capacity IT infrastructure at a low-cost, in which integration solutions can be deployed and run. The charging model adopted by cloud-computing providers is based on the amount of computing resources consumed by clients. Such demand of resources can be computed either from the implemented integration solution, or from the conceptual model that describes it. It is desirable that cloud-computing providers supply detailed conceptual models describing the variability of services and restrictions between them. However, this is not the case and providers do not supply the conceptual models of their services. The conceptual model of services is the basis to develop a process and provide supporting tools for the decision-making on the deployment of integration solutions to the cloud. In this paper, we review the literature on cloud configuration modelling, and compare current proposals based on a comparison framework that we have developed

    A Transducer Model for Web Information Extraction

    Get PDF
    In recent years, many authors have paid attention to web information extractors. They usually build on an algorithm that interprets extraction rules that are inferred from examples. Several rule learning techniques are based on transducers, but none of them proposed a transducer generic model for web in formation extraction. In this paper, we propose a new transducer model that is specifically tailored to web information extraction. The model has proven quite flexible since we have adapted three techniques in the literature to infer state transitions, and the results prove that it can achieve high precision and recall ratesMinisterio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

    AYNEC: All you need for evaluating completion techniques in knowledge graphs

    Get PDF
    The popularity of knowledge graphs has led to the development of techniques to refine them and increase their quality. One of the main refinement tasks is completion (also known as link prediction for knowledge graphs), which seeks to add missing triples to the graph, usually by classifying potential ones as true or false. While there is a wide variety of graph completion techniques, there is no standard evaluation setup, so each proposal is evaluated using different datasets and metrics. In this paper we present AYNEC, a suite for the evaluation of knowledge graph completion techniques that covers the entire evaluation workflow. It includes a customisable tool for the generation of datasets with multiple variation points related to the preprocessing of graphs, the splitting into training and testing examples, and the generation of negative examples. AYNEC also provides a visual summary of the graph and the optional exportation of the datasets in an open format for their visualisation. We use AYNEC to generate a library of datasets ready to use for evaluation purposes based on several popular knowledge graphs. Finally, it includes a tool that computes relevant metrics and uses significance tests to compare each pair of techniques. These open source tools, along with the datasets, are freely available to the research community and will be maintained.Ministerio de Economía y Competitividad TIN2016-75394-

    MostoDEx: A tool to exchange RDF data using exchange samples

    Get PDF
    The Web is evolving into a Web of Data in which RDF data are becoming pervasive, and it is organised into datasets that share a common purpose but have been developed in isolation. This motivates the need to devise complex integration tasks, which are usually performed using schema mappings; generating them automatically is appealing to relieve users from the burden of handcrafting them. Many tools are based on the data models to be integrated: classes, properties, and constraints. Unfortunately, many data models in the Web of Data comprise very few or no constraints at all, so relying on constraints to generate schema mappings is not appealing. Other tools rely on handcrafting the schema mappings, which is not appealing at all. A few other tools rely on exchange samples but require user intervention, or are hybrid and require constraints to be available. In this article, we present MostoDEx, a tool to generate schema mappings between two RDF datasets. It uses a single exchange sample and a set of correspondences, but does not require any constraints to be available or any user intervention. We validated and evaluated MostoDEx using many experiments that prove its effectiveness and efficiency in practice.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

    A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts

    Get PDF
    Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-
    corecore